A SUBSIDIARY OF 
LOCKHEED CORPORATION 


Lockheed 

Electronics 


79-1 ft T ^7 7 . 


JSC-12700 


1830 NASA Road 1, Houston, Texas 77058 
Tel. 713-333-5411 


Company, Inc. 


Ref: 642-6932 

Contract NAS 9-15800 
Job Order 73-715-12 


TECHNICAL MEMORANDUM 

REGRESSIONS BY LEAPS AND BOUNDS AND BIASED ESTIMATION 
TECHNIQUES IN YIELD MODELING 

By 

N. Marquina 


Approved By: 


-7. C. 



T." C. Minted Supervisor 
Techniques Development Section 


CE79-10177) BSGBES SIGNS BY LEAPS AND BOUND! 
AND BIASED ESTIMATION TECHNIQUES IN YIELD 
MODELING (Lock&eed Electronics Co,) 81 p H( 
A05/MP A01 CSCE 12j 


N 79— 20448 


Unclas 

G3/43 00177 


February 1979 


LEC-12379 



CONTENTS 


Section Page 

1. INTRODUCTION 1-1 

2. ORDINARY LEAST-SQUARES ESTIMATION 2-1 

2.1 LEAST-SQUARES ESTIMATION 2-1 

2.2 MULTICOLLINEARITY 2-4 

3. CHOOSING THE BEST REGRESSION 3-1 

3.1 ALL POSSIBLE REGRESSIONS 3-1 

3.1.1 FURNIVAL'S METHOD OF GENERATION 3-2 

3.1.2 BASIC ASSUMPTION 3-2 

3.2 ADJUSTED R 2 3-4 

3.3 MALLOWS' C p STATISTIC 3-4 

4. ALTERNATIVES TO ORDINARY LEAST-SQUARES ESTIMATION 4-1 

4.1 RIDGE AND GENERALIZED RIDGE 4-10 

4.2 MARQUARDT'S GENERALIZED INVERSE ESTIMATOR 4-19 

4.3 SHRUNKEN ESTIMATORS 4-26 

4.4 PRINCIPAL COMPONENTS REGRESSION 4-31 

4.5 LATENT ROOT REGRESSION . 4-34 

5. APPLICATIONS 5-1 

6. CONCLUSIONS AND RECOMMENDATIONS 6-1 

7. 'REFERENCES 7-1 


t fcvC.'U’,' 


EDING PAQHSBUXi! K5X FILMED' 


TABLES 


Table Page 

I LEAST SQUARES AND MODIFIED LEAST SQUARES 4-48 

II A' A, THE EXTENDED CORRELATION MATRIX 4-48 

III EIGENVECTORS OF A'A 4-49 

IV INDEXES FOR STANDARDIZED PREDICTION EQUATIONS 4-49 


FIGURES 


Figure Page 

1 The regression tree 3-3 

2 Cp plot . . . 3 t6 

3 Two-dimensional example 4-9 

4 Nonpredictive multi coll inearity 4-40 

5 Predictive multi col linearity 4-41 

6 Vertical norm 4-43 


v 



SYMBOLS 


p Total number of regressor variables 

n Number of observations 

x Independent or regressor variables 

X- The n-dimensional vector of observed values of the j th regressor 

J variable 

X The n x p matrix whose columns are the vectors 

(X'X) + Moore-Penrose pseudoinverse of the matrix X 1 X 

y The n-dimensional vector of observed values of the dependent 

variable 

Y.j The i th observation of the dependent variable 
Y Average of the components of y 

Y.j Estimated value of Y^ 

g True parameters or regression coefficients 

b Estimate of g 

e The error term 

a Variance of the components of error 

l- The j th eigenvalue of X 1 X in order of increasing magnitude 

J 

V. A normalized eigenvector of X'X associated with £. 

J J 

S' The matrix whose columns are the eigenvectors 

L The diagonal matrix of eigenvalues SL. 

J 

S r Submatrix of S consisting of r columns 
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REGRESSIONS BY LEAPS AND BOUNDS AND BIASED ESTIMATION 
TECHNIQUES IN YIELD MODELING 


1. INTRODUCTION 

The prediction of yield estimates based on meteorological variables is 
discussed in this technical memorandum. The primary statistical tool for 
the analysis is linear parameter regression. Multiple linear regression 
analysis is a procedure for the analysis of the relationships between two 
sets of variables, independent or regressor variables and dependent or 
response variables, whose values are believed to be related to the set. 
Estimation of the coefficients of the regression model is usually performed 
using least squares. 

The least-squares estimator of the regression coefficients has the desirable 
property of being unbiased and having minimum variance among the class of 
unbiased linear estimators. However, when near-linear relationships exist 
among the regressor variables (a situation known as multi coll inearity) this 
minimum variance can be quite large. Thus, estimation procedures other than 
least squares appear to be desirable when multi col linearity exists among 
the regressor variables. 

The meteorological variables currently used in yield modeling are highly 
correlated among themselves. A consequence of the multi coll inearity present 
in the meteorological variables is the large variance of the regression 
coefficients. Many of the applications of regression analysis in yield 
modeling either explicitly or implicitly place reliance on individual 
parameter estimates. Inferences about cause-effect relationships between 
the response and regressor variables based on individual coefficient estimates 
can be misleading, even erroneous, when multicoll inearity is present in 
.the data. In the presence of multicoll inearities , the estimated coefficients 
are highly unstable; the addition of one or more new observations can change 
the size and even the sign of some of the parameters (ref. 1). 


1-1 


I 



In this memorandum it will be shown that techniques other than ordinary 
least squares (OLS) exist to deal with the problem of estimation with 
correlated predictor variables. In particular, latent root regression, 
principal components regression, ridge, and generalized ridge will be 
examined. Texas and Oklahoma weather data are used for the illustrations. 

The programs that implement these techniques were developed by the author 
while attending the University of Houston; the Industrial Engineering 
Department of the University of Houston provided the computer time for the 
sample runs. 

In section 2, ordinary least squares are reviewed and the multicollinearity 
problem is defined. Section 3 deals with the problem of finding the best 
subset of variables to enter the regression analysis. Section 4 discusses 
three of the most important biased estimation techniques. In section 5, 
these ideas are applied to the weather and trend data for the Texas- 
Oklahoma Panhandle and Oklahoma, which were provided by National Aeronautics 
and Space Administration, Lyndon. B. Johnson Space Center (NASA/JSC) personnel. 
Conclusions and recommendations are presented in section 6. ■ References are 
listed in section 7. 



2. ORDINARY LEAST SQUARES 


2.1 LEAST -SQUARES ESTIMATION 


The multiple linear regression model can be written as 


Y i ' e o + T*h + <&h + ••• + s p x ip 


+ s- 


i = 1 » 2, 


, n 


( 1 ) 


where 


V. = the yield for the i th year 

6* B?, •••, B* = unknown parameters referred to as the regression 
u i p 

coefficients 

x*. = the value of the jth weather variable for the Hh year 

l J 

. = random error term for the Hh year 


For the purposes of increased computational accuracy, the relationship 
(eq. (•!.)) can be transformed (ref. 2} by letting 



where 


V* = 

x j 


n 

£ 

i=l 




The predictor variables are now standardized so that 


and 




0 


II ^ 

^ 1 x.. = 1 j j — 1,2, ***,p 
i=l 1J 
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Eq. (1) now becomes 


Y i 3 0 + 3 l x il + *' 


' + 3 p x ip + £ i 


1 = 1 , 2 , 


where 


e 0 - + ®M 

and 


b j 



2 


11/2 


In vector notation, eq. (2) is written as 

y = e 0 l + Xp + e 

where 

y = a'n n x 1 vector of yield measurements 

1_ = an n x 1 vector of Ts 

X = [X-j, X. 2 , ***, Xp] = an n x p matrix of constants 

3' = ((3-j, ***» 3p) = a 1 x p vector of unknown parameters 

= an n x ] vector of random error terms 


( 2 ) 


( 3 ) 


The following assumptions are used in this memorandum: 

a. The elements of X = [X-., X-, ♦•*, X ] are nonstochastic. 

I c. p 

b. X has rank p < n. 

c. The elements of y are observable random variables. 

d. The elements of e are unobservable random variables with E[e] = £ and 
E[ee'] = o 2 I n . 

2 

e. In addition, the assumption e~ N(£, o I n ) will be included when hypothesis 
testing is required. 
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It should he noted that assumption (b) states that the number of observations 
exceeds the .number of parameters to be estimated and that no exact linear 
relations exist among the columns of X. 

The least-squares estimate b of 3 is found by minimizing e'e with respect 
to 3 , i . e . , 

minimize e'e = (y - X3) ' (y - X3) 

= y'y - 23X'y + 3’X'X3 


(4) 

y'X3 s differentiating e’e 


These are the so-called normal equations. The solution to eq. (4) is given by 

b = (X'XrVy (5) 

The least-squares estimator is unbiased and has minimum variance in the class 
of unbiased estimators of the regression coefficients. If the normality 
assumption (e) is valid, eq. (5) is also maximum likelihood (ref. 3). 

The variance of the OLS estimator is given by 

Var(b) = o 2 (X'X) -1 

The variance of the estimator of a particular coefficient, b., is 

J 

■ - s/ 

while Cov(b. ,b.) = C-.a 2 

I J * J 


( 6 ) 

(7) 


The problem is reduced to solving 

X'XB = X'y 

which is obtained by using the fact that- 3'X‘y = 
with respect to 3, and solving '• 


or . -2X l y + 2X'X3 = 0 
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where 


C = [C.j] = (X'X ) -1 

Eqs. (6) and (7) play a central role in the discussion of multicollinearity. 

A complete discussion of the derivation of the least-squares estimator and 
its various numerical and statistical properties can be found in many 
standard texts on statistical analysis; special mention should be made of 
reference 2. 

2.2 MULTICOLLINEARITY 

It is very well established that the regressor weather variables in the Center 
for Climatogical and Environmental Assessment (CCEA) yield model are highly 
correlated (refs. 4 and 5). 

Let us analyze the effects of multicollinearity on the OLS estimator. 
Multicollinearity is .a form of ill-conditioning within the matrix, X, of 
regressor variables in which for some set of constants , a^ •••, ap not 
all zero, we have 

P 

(8 > 

j=l 

If the relationship is exact, there is said to be an exact multicollinearity 
among the regressor variables. In this case, (X'X)~^ does not' exist 
because the rank of X will be less than p. This implies that there is not 
one solution to the normal equations, but infinitely many solutions. To 
obtain a unique solution for eq. (4) when the rank of X is less than p, the 
Moore-Penrose pseudoinverse of X'X, (X'X) , should be. used (ref. 6) to obtain 
the solution 

b + * (X'X) + X'Y • 

Of primary concern in this paper are the cases where eq. (8) only approximates 
zero. When this occurs, we say that multi coll ineari ties exist among the 
regressor variables. Multicollinearity is explained in greater detail in 
references 7, 8, and 9. 


2-4 


(p 



Strong mul ti col 1 ineari ties among the regressor variables produce the following 
problem with OLS estimation of the regression coefficients: 

a. The estimates tend to be large in magnitude. 

b. The signs of the estimates are greatly influenced by the -mul ticoll ineari ty, 
which can result in estimates having signs which disagree with known 
theoretical (agricultural )- properties of the model. 

c. Variances and covariances of the .estimators tend to be extremely large, 
often' causing the experimenter to delete variables incorrectly. 

d. The coefficient estimates are very sensitive to the particular set 
of sample data, therefore the addition of a few more observations 
can cause large changes in the estimates. 

These problems are due entirely to the presence of mul ticol linearities and 
occur regardless of the true values of the regression coefficients. The 
difficulty centers around the fact that multicollinearities among the 
regressor variables cause X'X to be nearly singular. This, in turn, creates 
large values among the elements of (X'X) - . 


To illustrate these properties, suppose a linear relationship of the form 

shown in eq. (8) holds for the first k < p regressor variables with a- 

~ 3 

nonzero. .The diagonal elements of 


C = (X'X) 


-1 


can be expressed as 


Cjj = 1 ~ " 


-1 


j = 1, 2, 


where is the coefficient of determination of the least-squares regression 

of x. on the remaining p - 1 regressor variables. If j < p, x.. is involved 
J ~ J 

in the mul ti coll ineari ty and hence could be well estimated by the remaining 

2 

regressor variables. This results in an R. which is very close to 1 and 
consequently a C . . which is very large. Since 

J J 


Var < b 3J> ‘ C d/ 
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the variance of the estimator of the regression coefficient of x. is very 

-1 J 

large. The off-diagonal elements of (X'X) can be represented as 


C. . 
ij 


- S iU P -2)/\b - R J.(p-2)]i 

where S.. , is the partial covariance of x. and x. adjusted for the 

1 j . 2 * j 


and R t is the coefficient of determination 

J. (p-2) 


remaining p-2 variables. 

for the regression of x. on the remaining p-2 variables, excluding x- . Thus, 

J 2 . 1 

if x. and x. are both involved in the mul ticoll inearity, R. will be close 

I J 


to 1 while S.j ^ 2 ) generally will not be close to zero. 

Cov < b i" bp - C ij0 2 


Therefore, 


typically will be large in magnitude. 


The least-squares estimator of the individual regression coefficient can 
be written as 

P 

b. = y\ C - .(XI ) i = 1, 2, p 
• >1 

Hence, if x. is one of the variables involved in the mul ticoll inearity, 

several of the C-. will tend to be large. in magnitude, in turn yielding a 
^ J 

3- which is large in magnitude. This is due primarily to the multicollinearity 
and does not necessarily reflect the true values of the regression param- 
eters 3- 

If we let £-j < Jig < ••• < £p be the latent roots or eigenvalues of X'X as 
defined by the equation 


|X'X - y p | =. o 

and let Vp V^, •••, V be -corresponding latent vectors or eigenvectors 
of X'X as defined by the equation 

(X'JOVj - 
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subject to the constraints 


and 



= 1 


ViVj =0 ; f f j 

then we can write 


P • 

C = (X'X) -1 - 2 (9) 

j=l 

Equation (9) provides another way of illustrating the problems with least- 

squares estimation. The presence of multicollinearities means that X'X 

will be near singular and hence one or more of the eigenvalues, l . , will be 

-1 d 

near zero. This creates the large elements in (X'X) mentioned above. 


The problems associated with least-squares estimation motivate the need for 
alternative methods of estimation and analysis when confronted with multi- 
coil inear data. Several recently proposed alternatives are outlined in 
the next sections. Of necessity, all are biased estimators, but each will 
be seen to have several desirable as well as undesirable properties. 



3. CHOOSING THE BEST REGRESSION 


A major problem in regression analysis is. that of deciding which regressor 
or predictor variables should be in the model. There are two conflicting 
criteria for selecting a subset of regressors. First, the model chosen 
should include as many of the X's as possible if reliable predictions are 
to be obtained from the fitted equation. Second, as discussed in section 2, 
the variance of the predictor increases with the number of regressors. A 
suitable compromise between these two extremes is usually called "selecting 
the best subset" or "selecting the best regression equation." 

The CCEA model contains 23 predictor or regressor weather-related variables. 
This author was requested not to consider square or cross-product terms; such 
analysis should be done as part of the follow-on' to this study. 

It is recognized that individually the weather variables in the CCEA model 
provide little information but collectively they do reasonably well. Under 
these circumstances, it has been shown (ref. 10) that the all -subsets 
approach is much better than backward, forward, or stepwise regression when 
selecting a suitable 'subset of regressor variables. 

3.1 ALL POSSIBLE REGRESSIONS 

Algorithms have been described (refs. 11 and 12) for computing all possible 

regressions which are much superior to the naive approach involving the 

direct inversion of the moments matrix, associated with each subset of 

independent variables. The number of operations per regression decreases 
3 2 

from kp to kp . If less output for each regression is satisfactory, 
further savings are possible. By computing the regression coefficients, 
their variances, and the residual sum of squares with a number of operations 
per regression, which is of order p, and if we are satisfied with only the 
residual sum of squares (RSS), the number of operations .per regression can be 
reduced to slightly less than six (ref. 13). As there are two possibilities 
for each regressor, "in" or "out" of the equation, there are 2 P such 
regressions. 



A systematic procedure for generating all possible regressions is given in 
references 11, 12, and 14. Garside (refs. 11 and 14) represents each 
regression by a K-digit binary number; for example, if K = 4, the binary 

code 1010 would represent the model E[Y] = Bq + B-]X-| + £ 3 X 3 . For K = 3, we 

have 000, -100, 110, 010, 011, 111, 101, 001. These are the coordinates of 
the vertices of a K-dimensional hypercube; finding an efficient procedure 
is equivalent to finding a path along the edges of the hypercube which will 
pass through each vertex only once (a Hamiltonian walk). 

3.1.1 FURNIVAL 1 S METHOD OF GENERATION 

A Gaussian elimination method given by Furnival (refs. 13 and 15) is best 

described in terms of a "regression tree," as shown in figure 1. The 

Gaussian elimination operator is-applied to each- pivotal element just once 
in the order given by the binary tree. The full matrix is at the root of 
the tree, -and at each interior node a submatrix is derived from the parent 
matrix by a series of pivots (solid lines) and deletions (dashed lines). 

The regression tree can be traversed in any "biologically feasible" order; . 
the only restraint is that a father be "born" before his son. By using 
horizontal, vertical, or hybrid searching techniques, Furnival obtains a 
number of regression sequences which he describes as natural , lexicographic, 
binary, and familial. 

3.1.2 BASIC ASSUMPTION 

A number of authors have described procedures .for finding the best subset 
regressions without computing all possible regressions (refs. 16, 17, and 18). 
All of these methods are based on the fundamental inequality 

RSS(A) < RSS(B) 

where A is any set of independent variables; B is a subset of A; and RSS is 
root sum square. (See ref. 15 for more details.,) 
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3.2 ADJUSTED R 2 

One measure of goodness of fit of a regression model widely used in the past 
is the coefficient of determination 

r 2 - *> 2 
■ Xcv, - V ) 2 

2 

Since introducing an extra regressor increases R , the problem is not finding 

2 

the subset with maximum R (which in any case is the set of all p regressors) 

2 

but rather that of finding a suitable subset with a high R . 


2 

The adjusted or corrected R statistic is given by 


—2 

R = 1 - 

1 - R 2 


n 

m 

m 


n - m 


where m is the number of parameters in the model. 

— 2 

To see the effect on R of adding extra regressors to the equation, consider 
the F-statistic for testing the significance of q new additions (ref. 2): 

2 2 

f = m+q ~ R m n - m - q 

i - r 2 . q . 

m+q 

It follows that 


’ if and only if F > 1 . 


R 2 

m+q 


>? 

- m 


One criterion, therefore, for selecting the best regression is to choose the 

_2 

regression subset which maximizes R . 

m 

3.3 MALLOW'S C p STATISTIC 

Consider a q-parameter model, q < p: If = EfY^], then will generally 

differ from x^B^ because of possible bias in the q-parameter model. 


<3 
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E 

r - o" 

(v - e) 2 

= Var 

A . 


q 


q 


Let 6 = E[Y] - Then, for a given future data point _X, 

+ (N - e) 2 
q 

= cr 2 xL(^>0 _1 x + (N - 8) 2 
-q -q-q' -q q 

As pointed out in reference 19, it is perhaps more appropriate to use the 
sum or the average, in some sense, over the future observations of interest 

Mallows (refs. 20-22) and others (ref. 23) suggested to minimize 

n 


h = -V E 

q 2 
^ u 


2>qi - 6 i>‘ 


_i=l 

SSE5 


= q + 


where SSB is the -bias sum of squares, given by 

n 

SSB q - 2 <V' e i ,; 

i=l 

-It can be shown (ref. 20) that 


RSS 

c = a 

q- -2 

^ o 


+ 2q - n 


is a suitable estimate of A^ (fig'. 2). 


Notice that adding S = q 0 - q, regressors to a model may reduce the bias 

z i 2 

term SSB, but at the expense of increasing the variance term from q^a to 

q 2 <? 2 . If the equation- is needed for prediction, it may be better to drop a 

few regressors and accept some bias in exchange for a smaller A^ and a 

simpler equation. 
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It can be shown (ref. 24) that 


V 


RSS (n - p - 1) 


RSS 


P+1 


n + 2q 


and 



1 - 


1 


1 - 


R 2 

Vi 


- 1 
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4. ALTERNATIVES TO ORDINARY LEAST-SQUARES. ESTIMATION 


Some attention has recently been given to two aspects of regression analysis. 

The first aspect is the attempted improvement of point estimators where 
the criterion of goodness is the mean-square error {refs. 25-27). Sclove 
(ref. 27) discusses an estimation technique which guarantees that the sum 
of component-wise mean-square errors of the biased estimator is smaller 
than that of the ordinary unbiased least-squares estimator. He presents 
some further results under the restrictive condition that the independent 
variable of the model can be ordered in importance prior to analysis. These 
procedures can be somewhat difficult to implement, and very little is known 
about the distributional properties of the resulting estimators. 

The second aspect of regression considered recently is the problem of point 
estimation where there is a high degree of multi col linearity among the 
predictor variables (refs. 28-33). .Hoerl and Kennard (ref. 28) propose a 
class of biased estimators called ridge estimators; their criterion of 
goodness is mean-square error. The technique is relatively easy to use, 
and it may be shown that the class contains estimators which have smaller 
mean-square error than the least-squares estimator. However, they are 
not able to provide a well-defined and unique choice of estimators from 
this class, nor have they been able to prove that their suggested procedure 
actually chooses a member of the class which achieves smaller mean-square 
error. In fact, Newhouse and Oman (ref. 34) have reported some Monte Carlo 
simulation results which indicate that ridge estimators do' not in general 
perform better than least-squares estimators. 

LaMotte (ref. 35) presents some of the properties of best and Bayes linear 
estimators. The best estimator follows. 

Let L Q be an n x p matrix, then the linear estimator Lgy of B is best at (Og, Bg) 
if there exists a (og, ’Bg) such that for any L, 

TMSE^tog, B 0 ) < TMSE L (a Q , B Q ) 
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where 


TMSE L (a, 3) = E[(L'y - 3)'(L'y - 3)] 

='trace[a 2 L'L + (X'L -'I)'33'(X’L - I)] 

The Bayes linear estimator is a linear estimator with minimum average total 
mean-square error (TMSE), averaged over values of (a, 3)- Basically, one 
substitutes y = E[cr ]■ and <j> = E[33‘] in the above equation for the TMSE 
obtaining 

E[TMSE L (a, 3)| Y, <f>] = ETMSE L (y, <j>) 

= trace[yL'L + (X'L - I ) 1 <f>( X 1 L - I)] 

Marquardt (ref.- 36) introduces a class of biased estimators called generalized 
inverse estimators. Mayer and Willke (ref. 30) discuss a number of classes 
of biased estimators called shrunken estimators. These classes contain 
members with smaller mean-square error than the least-squares estimator. 

It is not known how to choose such members, however; and there is very 
little known about the distributional properties of these estimators. 

Kendall (ref. 31) and Massy (ref. 32). discuss principal components regression, 
which was not introduced as a method of biased estimation -but will be shown 
to provide biased estimators. The method is very closely related to 
Marquardt's and was introduced for -use when there is multicollinearity. 

In this chapter a method of unifying the treatment of these biased estimation 
methods and of unbiased least-squares estimation is considered. The 
presentation centers on ai duality' of the X'X matrix of the normal equations 
for unbiased least-squares estimation.. The duality is in the sense that 
the spectral decomposition of X'X. into its eigenspace representation has 
the property of describing how well the data points are spread out in the 
data space. A similar decomposition of (X'X) -1 (or a generalized inverse 
of X'X if it is singular) has the property of describing how the distribution 
of the estimator b is spread out in. the parameter space. We lean heavily on 
these decompositions to discuss the interrelationships of all these estimation 
methods and to describe the consequences of using them. 
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The near-singular case of X'X will be dealt with in this section. A measure 
of ill-conditioning of the X-matrix is its condition number C[X] which is 
defined as the ratio of the largest to the smallest nonzero singular value 
of X. The singular values of X are the positive square roots of the 
eigenvalues of X'X. 

A more' precise definition of ill-conditioning follows. A set of linear 
equations BX - c is said to be ill-conditioned if small errors or variations 
in the elements of B and c can have large effects on the exact solution 
X. For example, the difference dX between the solution of BX = c and that of 

(B + dB) (X + dX) » c + dq 

can be expressed as. • 

dX = (B + dB) -1 (dc - dBX) 

and its value depends critically on the inverse matrix. If B is near 

singular, that is, small changes in its elements can cause singularity, then 

dX could be very large. In the case of the normal least-squares equation, 

B = X' X and c = X'Y will contain roundoff errors because they must be 

computed from X and Y.‘ Even if B could be computed exactly, it would not 

necessarily be stored exactly in the computer; all numbers are stored in 

binary mode, and a decimal number such as 0.1 is a nonterminating binary 

fraction. If X is ill-conditioned, small changes in the elements of X can 

-1 -1 

cause large changes in (X'X) ; and if b = (X'X) X'Y, then any errors in 

the formation of X'X could have a serious effect on the stability and 
accuracy of the solution. As an illustration, Searle (ref. 37) discusses a 
model for the weights of six rubber plants, three of which are normal, two 
of which are off-type, and one of which is an aberrant. The data are 
presented in the following matrices. The model considered is 

y ij = ^ + b l + b 2 X 2 + b 3 X 3 + £ 
where X. = 1 if the plant is of the it/ztype; otherwise, X^ = 0. 



Let 



and it is seen that X 1 X is singular and of rank 3. A generalized inverse 
of X'X is 


G = (X'X)" 


0 0 0 

0 1/3 0 

0 - 0 1/2 

0 0 0 



and for this choice of G we have 


H = GX'X 


0 0 0 . 0 
110 0 
10 10 
10 0 1 
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Hence, all estimable functions are of the form 


w'Hb = (w-j + W2 + w^)y + w-jb-j + w-^ + w^b^ 

and there are, at most, three linearly .independent choices of w. The 
unbiased estimates of w'Hb are given by 


w ’ GX ' Y = w,*,. +w 2 y 2> +w 3 y 3 


Three reasonable choices for independent estimable functions are: 



and p + l/3(b^ + + bg) 


Their corresponding ' estimators are 

yu-yz. ‘ 14 
h. - y 3 . ■ 54 

and 1/3 ^1. + ^2. + ^3 = 72 2/3 

To consider what problems arise as the gap is slowly bridged from X'X 
nonsingular to singular, modify the previous example, barely removing it from 
singular setting, and perform a regression analysis. Performing an 
experiment to study the abrasion resistance of rubber as . function of the 
amount of three particular additives, let x^ denote the amount of pounds 
of the Hh additive which is loaded with an approximately 1000-pound 
charge to the chemical reactor which produces the rubber. The proposed 
model is 


y = X8 + e 


A 
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where 


1 0.99 0 0 

1 1.00 0 0 
x= 1 1.01 0 o 

1 0 0.99 0 

10 1.01 0 
JO 0 1_ 

y ‘ = [101 , 105, 94, 84, 88, 32] 


3 _ Ev*» » @ 2 ’ 3j] 


and e is the random error. In this example. 


X'X 


6 3 2 1 

3 3.0002 0 0 

2 0 2.0002 0 
10 0 1 


and it is evident that X'X is not singular, yet it is nearly so. Let J 
denote the eigenvalues of X'X. For the matrix X'X above 


'= 8.41888 
= 2.38695 
£ 3 - 1.19444 
& 4 = 0.00010 


Since = 0.0, X'X is nearly singular. The parameter estimates 


u = 168.002 
b ] = -68.0212 
b 2 = -81.9742 
b 3 = -136.002 
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obtained with a covariance 

matrix of 

tx'xrV, 

where 


‘ 2500.38 

-2500.21 

-2500.13 

-2500.38 

(X'X) -1 = 

-2500.21 

2500.38 

2499.96 

2500.21 

-2500.13 


2500.38 

2500.13 


-2500.38 



2501 .38 


It is evident that X 1 X is formally of rank four although it is essentially 
of rank three and that the resulting (X'X)" 1 matrix indicates a large variance 
in the parameter estimates. However, recall the estimable functions discussed 
in section 3.3, and compute 


b-j - b 2 = 13.9530 
b 2 - b 3 = 54.0278 
p+ l/3(b 1 + b 2 + b 3 ) = 72.6695 


( 10 ) 


Allowing for the fact that X was slightly changed to provide nonsingularity, 
the agreement is admirable. Although the variances of the raw estimates are 
quite large, consider the variances of the linear combinations of parameters 
in eq. (10). The linear combinations are defined by K'b where 


K = 


0 0 1 

1 0 1/3 

-1 1 1/3 

_ 0 -1 l/3_ 


The covariance matrix of K'b is given by K'(X'X) 


-1 



But 


K' (X'X) ] K = 


0.82 

-0.33 

0.05 

-0.33 

1.50 

0.17 

0.05 

-0.17 

0.53 


L 


It is thus evident that, even though the full parameter vector is quite ill- 
determined, the linear combinations of the parameters corresponding to the 
estimable functions of the previous example are well determined. 
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The normal equations matrix X'X plays the central role in linear model 
estimation and hypothesis testing. Note that X'X has a spectral decomposition 
(ref. 2) or representation as 

P 

X ' X = £Vi V i (11) 

i=l 

where > ••• > £ > 0 are the eigenvalues of X‘X and are corresponding 
normalized eigenvectors of X'X. If r is the rank of X'X, then a similar 
decomposition (ref. 2) of (X'X) is 


If r = p, then 


(X'X ); 1 = 



(X'X)‘ = (X'X) 


-1 


-Lit Vi 

i=i 1 


( 12 ) 


It is extremely important to note that the spectral -representations of 
eqs. (11) and (12) are not invariant under linear transformations of X. 
Invariance may be attained by assuming the linear model is always considered 
in its correlation form. In the remainder of this report, it will be assumed 
that all diagonal elements of X'X equal unity. The spectral decomposition of 
X'X by eq. (11) indicates how and how well the variables' space is spanned 
by the experiment. If &. = 1.0 for all i, then in a sense the variables' 
space is perfectly spanned. If £-j » 1 ^, then the variables' space is not 
well spanned. In fact, X Q V-j represents the linear subspace (or linear 
combination) of predictor variables which is best spanned, and XqV represents 
the linear combination of variables most poorly spanned. In fact, if = 0, 
then XgVp is- not spanned at all. These considerations are discussed by 
Kendall and Stuart (ref. 38). To illustrate the preceding paragraphs, 
consider the following two-dimensional example. Suppose the data points 
observed are as plotted on figure 3. 
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Assume that the line is the line X-j - = 0 in the variables' space and 

that Lg is the line X-j + Xg = 0. Assume also that the two extreme points 
along l_ 2 are equally distant from X-j - X 2 = 0. It is immediately seen that 
the observations are much more spread out along L-j than along L^. For such 
a situation, and XgV-j is well spanned, while XqV 2 is poorly spanned. 


Considering the parameter space, it is well known that the least-squares 
estimator b (under normal distribution theory) follows the normal distribution 
N[(X'X) + (X'X)b, a 2 (X'X)*]. For a linear combination of the estimates w'6 
(ref. 37): 


Var(w'b) = w'(X'X)*wa 2 


It can be shown, (ref. 39) that the. choice of w which minimizes the variance 
of w'b is w = S-j and that this variance is 


Var(V-jb) = Vj (X'XjYjO 2 



The choice of w which maximizes the variance of w'b is w = V and 

P 


■Var(V'b) = — (assuming p = r.) 

r 


Thus, V-jb describes the most determined linear combination of the parameters, 

while V'b describes the least determined. .In fact, if £ =0, then V'b is 
P P P 

nonestimable and hence not determined at all; an interpretation is that 

Vpb has infinite variance. 


4.1 RIDGE AND GENERALIZED -RIDGE 

One recently proposed alternative to least-squares estimation of the regression 
parameters which has received considerable attention in the statistical 
literature is ridge regression (refs. 28 and 29). When multicollinearities 
exist among the regressor variables, the least-squares estimates b^ tend to 
be large in magnitude-. This can result in b being, far removed from 8, in 
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terms of Euclidean distance, even though b is an unbiased estimator of g. 
If L-j denotes the distance from b to 3 (ref. 28) 




i=l 


where - •••< £ p are the eigenvalues of X 1 X as before. As pointed 

out in the previous section, the existence of multi coll ineari ties means that 
X'X will have one or more small eigenvalues, thus the distance from b to 3 
will generally be large. 


Reviewing some of the most important properties of the ridge estimator, recall 
that the best linear unbiased estimator of 8 is 

b f (X 1 X) — 1 X * Y - 

Then X'X may be represented as X’X = PIP', where P is the orthogonal matrix 
whose columns are the normalized eigenvectors of X'X and L is the diagonal 
matrix of eigenvalues. If one considers the transformation to new predictor 
variables defined by 


and the model 


W = XP 


then 


y = Wa + e 


a = 


P'b 


(13) 


■ W'W = L 
a' a =- b' b 


The generalized ridge estimation procedure, defined by the family of 

estimators indexed by the parameters >0, is 

» 

a* = (W'W + K) _1 W'y .(14) 
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where the matrix K is defined by 

K = diag{k. } - i = 1, 2, •••, p 

When all = 0, a* is the OLS estimator and is unbiased. When any 
k- > 0, the resulting estimator for a is biased, defining the mean- 
square error of a* as 


M ( K) = E[(a* - a) '(a* - a)] 

2 2 

It may be shown (ref. 28) that the choice of k. = a /a_j will minimize M(K) 

among the class of estimators defined by eq. (14). Unfortunately, in order 

2 2 

to utilize this optimal choice of , one must know- both a and a^. To 
circumvent this seemingly hopeless situation, one must resort to the following 
iterative procedure (ref. 40). 

1. Using OLS procedures on the canonical model, eq. (13), estimate the a.‘s 

J 

by computing 


a = (X'X) _1 Xy 


2 2 

and estimate a - by s . 


2. Use the value of s and the a.‘s from step 1 to compute 


k = - — 
- 2 

.. a j 


; := 1 . 2 . 


3. Use the k.'s to solve the expression 

J 

a* = (W'W + K) _1 W'y 

and thus obtain initial estimates of the at's. Next compute 

J 

P 

a*'a* = £ a f 

3=1 J 


Repeat steps 2 and 3 using the a*'s from step 3 and again compute 

J 


a* 'a*. 
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5. Continue .this iterative procedure and terminate when stability is achieved 
in a*‘a*. 

6. The generalized ridge regression coefficients are now given by 

b = P‘a* 

The above procedure has some intuitive appeal', but since the distributional 
properties of the resultant estimator are unknown, its validity as a statis- 
tical tool is subject to questioning. ' 

Simulations (refs. 41, 42, and 43) have shown that ridge estimators of the 

< 

form of eq. (14) do provide smaller mean-square errors than the OLS. In fact, 
reference 28 shows that if k* > 0 and if k^ = k* (i = 1, 2, ..., p) and 
hence 


a* = (W'W + k*I ) _1 X'y (15) 

r 

then there exists a k > 0 such that the mean-square error (M(k)) of a* is 
less than the mean-square error M(0) of the least-squares estimator b, where 

M(0) = E[ (3 - b)'(8 - b)] 

In practice, one must estimate k from the data. The properties of the 
estimator a*, when k is estimated from the data, are unknown. An optimal 
method for selecting a suitable value of k has been the center of much 
recent discussion in the statistical literature. Various methods have been 
proposed (refs. 28, 29, 36, 41, 44, 45, and others). An excellent discussion 
of ridge regression and the various methods for choosing k is contained in 
reference 46. 

Expressing' the ridge estimator eq. (15) as 

b(k> = (X. ! X + kI p rVy (16) 

where k > 0 is nonstochastic . 


1 


In the literature, eq. (16) is referred to as the ordinary ridge estimator. ' 



Obviously, if k = 0, eq. (16) is the OLS estimator; i.e., 

b(0) = (X * X)~ ] X 'y 
It has been shown (ref. 28) that 

P £ . 

M(k) = a 2 V 1 9 + k 2 b’(X‘X + kl)" 2 b 

U <‘i - M 2 

and- that M'(O) < 0. Observe the effect of k on the quantities V j b ( k ) . 
The equality 


Vlb(k) = V! (X'X + kl) Vy 

= v i(2iT J nrVj) x ^ 


i. + k "i 


v:x*y 


is immediately obtained; also 

E 


[q b < k )] = I-5T VIX'ECy] 




TT v i b 


and Var[V!b(k)l = j v i r } XV i 

L 1 J (£. + k) 6 1 

v 2 

" U-j + k ) 2 

Thus for any nonzero k, V-Jb(k) is the least biased linear combination of the 
estimator and V^b(k) is the most biased. Also, Vjb(k) has the least reduced 
variance, and V^b(k) has the most reduced variance. Thus, the best determined 
linear combinations of the parameter estimates are the least modified, while 
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the least determined are the most modified. In effect, as k increases, 
those Vjb(k) corresponding to small are rapidly driven to zero. 

Recall that the. predicted regression function at Xq is 

y 0 - X Q b(k) 

and the mean-square error of this predicted regression is denoted by 


M p [y 0 |b(k)] = E{[y Q - X Q b(l<)] * [y 0 - X Q b(k}]} 

= E{[X 0 (X'X + klJ^X'e + X Q (Z -• I)b]'[X 0 (X'X + kI) _1 X‘e 
+ X Q (Z - I)b]} . . . 

= ■E[e , X(X'X + kI)" 1 X 0 'X 0 (X'X + kI) _1 X'e] + b'(Z - OX^Z - I) 
= Y-|[b(k)] + t 2 E b ( k ) ] 

where y-|[b(k}] corresponds to the variance and [ b ( k ) ] corresponds to the 
bias squared. 

THEOREM 1: ‘ 

The variance function y-|[b(k)] is a monotonically decreasing function of k 
and y'[b(0)] < 0 (ref. 47). * 


PROOF: 

2 

Note that if we assume s ~ N(0, o I) then y-j is the expectation of a quadratic 
form in assumption e, thus (ref. 37) 

Y-j [b(k)] = c 2 tr[X(X’X + kI)' 1 X^X Q (X' X + kI)“ 1 X' J 

Note that 


(X'X + kl) 1 



1 

£ i + k 


V 


• V'. 
1 1 
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and hence 


which is n * 1 . 

Y 1 [b(k)] 


a 


X(X'X + kirV Eit^ x Wo 
Thus, 

i 

(Z itVtt x Wo)(E i^nr x o v j v d x )] 


= tr 


■ E E a + k k * k) tr [( xv j v j x o)( x o v i v i x ')] 

i j J 

" E E (x i + FHT + k)( x o v i v i x ’) ( xv i v i x o) 

• - * .1 


1 J 


= SEil 


1 J 


(£, + kR^ +-k) A 0 


M? 




z. 


Y, ? X n V,V’-X' 

Y + k) 2 0110 


Note that 

; Y-, [b(0)] “• o^X Q (X' X) -1 Xq 

Y-, [b( 0) ] = 0 
and 


X^[b(k)] = -a 2 


21 . 
1 1 


X rt V,V,XA 


e u. ♦ k ) 3 


< 0 


Thus, y-j’ is a monotonical ly decreasing function of k, as was to be shown. 



THEOREM 2: 


The bias function Y 2 [b(k)] satisfies y 2 P>(0)] = 0 and y£[b(0)] - 0. 


PROOF: 

Recall Y z [b(k)] = b 1 (Z - I)XqX q (Z - I)b. Since 

• Z - I = (X'X + kI) _1 X'X - I 


*i 


tJlj. v.'v i 

+ k ,ii 


we obtain 



b '¥iWi? 


Let 

k 2 

f ij (k) " (l f + k)(lj '+' k) 


then 


tijOO 


k 2 U, + l.) + 2kl.i, 

1 J LJ- > o 

(i>. + k) 2 (^ + k) 2 


Thus, it is easily seen that Y[b(0) ] = 0 and Y 2 [b(0)] - 0. 


One of the most important results of the ordinary ridge regression is the 
following theorem. 

THEOREM 3: 

Mp[y 0 |b(k)] is initially decreasing .in k. 

PROOF: 

Since M [y Q |b(k)] = Y-,[b(k)] + Y 2 Cb(k)] the result follows directly from 
theorems 1 and 2. 



Hoerl and Kennard (ref. 28). discuss a general Bayesian interpretation for the 
ridge estimator. Marquardt (ref. 36) gives a more specific relationship to 
Bayesian estimation, as follows. 

THEOREM 4: 

The ridge estimator is equivalent to a least-squares estimator when the 
actual data are supplemented by a fictitious set of data points taken according 
to an orthogonal experiment H^; the response y is set to zero for each of 
these supplementary data points. 

PROOF: 

Augmenting the X-matrix by H k , the least-squares normal equations become 


• . . . \ 

— — 

X 



_ y‘ 

: H k) 

• • • 

H k 

L J 

b = (X 1 

• H k)| 

• • • 

_ o_ 


or 

(X'X + H‘H k )b = X'y (17) 

Since is orthogonal, is a scalar multiple of Ip ; for any value k, the 
matrix may always be scaled such that = kip, and eq. (17) is identical 
to eq. (16). 

To illustrate, possible choices for are (a) = k^I , or (b) = 2? 

factorial experiment with the variables at levels -a and +a , where 
a = (k2 - P)^ 2 . This theorem illustrates from another viewpoint the 
mechanism by which the regression coefficients' are -damped by the ridge 
estimator. The estimator is seen to be a type of weighted average between 
the actual data and other data (in Bayesian terms, the prior information) for 
which the response values are arbitrarily set to zero. (For nonstandardized 
variables, the response values for the fictitious data would be set equal to 
the mean response to the actual data if the model y = X6 + e contains a 
constant term.) 



An excellent paper by Obenchain (ref. 48) develops the theoretical foundations 
to hypothesis testing and confidence regions for ordinary and generalized 
ridge regression estimators. He shows that any ridge estimator with strictly 
positive and nonstochastic "shrinkage factors" , k 2 , k p yields the 
same exact F (or t) statistic for the test of any linear hypothesis as does 
least squares. It follows that the unbiased confidence region based upon 
the .F (or t) distribution corresponding to any such ridge estimator is 
identical to the least-squares region of the same confidence. 


4.2 NIARQUARDT'S GENERALIZED INVERSE ESTIMATOR 

Marquardt (ref. 36) discusses a method of applying generalized inverses to 
biased estimation. He also considers some relations among these estimators, 
ridge estimators, and nonlinear estimation. He .considers the model . . 

y = X$ +-e * (18) 

where the X-matrix has been scaled so that X'X is in the correlation form. 

His family of estimators is indexed by a parameter h where 0 < h < p. The 
family is defined by 


b m (h)' = (X’XjJx'V 

The matrix (X'X)^ is defined as follows: let h* = [h] denote the greatest 

integer in h and dh = h - h*. Then (X'X)j^ is defined as 


‘(X'X)J 


h* 

= E 

d=l 


tv;* 


dh 

V+l 


V h*+T V '‘ h*+T 


. (19) 


= G h 

As the notation is meant to indicate, (X'X)^ is closely related to a gener- 
alized inverse of X’X. In fact, if r = rank (X‘X)', the ( X 1 X ) ^ is the Moore- 
Penrose pseudoinverse of X’X and is unique (ref. 49). An, important point 
to note is that the Moore-Penrose pseudoinverse yields the minimum-norm 
solution to the normal equations (ref. 50). 
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Marquardt's estimators thus provide a sort of minimum norm solution to. the 
normal equations. He also shows that there always exists a 0 < h < p such 
that 

M(h) = E j[b^(h) - b]‘[b m (h) - b][ 

is minimized. It is also shown that M’(p) > 0 so that the mean-square error 
of b m (h) is initially decreasing as h decreases from p. As with the ridge 
estimators, no way is yet developed for determining the "best" h. 


With X scaled so that X'X is in correlation form, Marquardt labels the 

4. 

diagonal elements of (X'X)^ as variance inflation factors. His suggested 
analytical procedure is to consider several estimates b m ( h ) for h between 
p and 0. He suggests the rule of thumb that an acceptable value of h.is one 
such that the maximum variance inflation factor should usually be larger 
than 1.0 but certainly not as large as 10.0. Marquardt has not been able 
to show that this procedure results in a reduction in M(h). 


For these estimators, 


0 


h < i - 1 


Vlb m (h) -=-Vl (X'X^X'y = ^ViX'y i - 1 < h < 1 


( 20 ) 


E [ V i b m(h)] =<dhVlb 



( 21 ) 


i < 1 


4-20- ■ 



and 


/ 



h < i - 1 


i - 1 < h < i 


i < h 


( 22 ) 


From eqs. (21) and (22), the following behavior is seen as h decreases from 
h = p: the V ib (h) are successively set to zero in order of increasing l . . 
The best determined linear combinations of the parameter estimates are the 
last to be set to zero, while the least determined are the first to be set 
to zero. 


THEOREM 5: 

The estimate b m (h) is a linear transform of b, and the transform depends only 
on X and h. 


PROOF: 

Let A = X'X, then 
and 


S'AS = L 


A + = SL'V' 


Suppose A is of rank r, so that the last (p - r) ordered elements of L are 
zero (or nearly so, if A is only "nearly singular"). Partition S as follows: 

s = (s : s ) 

v r p-r ; 

where S r is [P x r]; Sp_ r is [p x (p - r)]. Partition L similarly 


L = 


* 0 
r • 


0 : L_ 


'p-rJ 
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is [(p - r ) x (p - r )]. 


where L is [r x pi; L 

r L J p-r 

Now, by supposition, L is zero, so that L“| = 0 by definition. Thus. 

■ r 

the psuedoinverse becomes 

A + = S L S’ 
r r r r . 


Therefore 


but 


and thus 


b ( h ) = SL'Js’X'y 

m r r r J 


X'y = (X'X)b 


b m (h) = s r L;‘s;{x’x)b 


Z b 
r 


It follows immediately that b m (h) is a biased estimator of 3, if L p r is a 
nonnull matrix. If L p _ r is precisely a null matrix, b m (h) is conditionally 
unbiased relative to the constraints E[b m (h)] = Z p b implied by the columns 


of S 


P-r* 


THEOREM 6: 


The variance of b (h) is 
m 


Var[b m (h)] 


° 2 [ S r L r ls rJ < X ' X > [ S r L r' S r] 


PROOF: 


Var(b) = a 2 (X’X)" 1 
Var(Z r b) = a 2 Z r (X’X) _1 z; 


thus , 



Substituting 2 , the result is immediate. • It can also be shown that 

Var[b m (h)] = o^L^s; 


THEOREM 7: 

The mean-square error of b(h) is 

E[L?] - tr[Var{b m (h))] + 3'(Z r - l)HZ r - 1)6 
This may be proved in the same manner as in ridge regression (section 4.1). 

2-i 

The second term on the right side of E[L-|] is the square of the bias; it will 
be zero when r = p. 


COROLLARY 1: 

2 

The variance term in E[L-j] is an increasing function of r. 


PROOF: 

Employing eq. (19), we have 


Hence, 


s r L r s r 2 a, V j V j 

■ j=l • 0 


tr [ S r L r ls ;]= Er tr[V j V j ] 
j=l 0 


But tr[V.V'] = IjV.Ji =1.0, since S is an orthonormal rotation. 
J J , J * 

tr C s r L ' ls 4 • i h 

j=l J 


Thus. 



Since Z. > 0 for all j, eq. (23) increases monotonical ly with r. In the special 
3 

case where the data are orthogonal, i.e., all Z- - 1.0, and where r = p, this 

2 J 

result becomes pa , as expected. 
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COROLLARY 2: 


The bias term in E[L^] is a monotonic decreasing function of r. 


PROOF: 

The bias term is 


(Bias) 2 = 3’(Z r - I) 1 (Z r - 1)3 


= 3 , [S r L' 1 S r (X'X) - l]'[s r L" 1 S r (X*X) - l]B 


Partitioning the several matrices in (Z^ - I), and simplifying, results in 

(Z-I)-SS'-I 
v r ' r r p 


and therefore. 


= -S S' 
p-r p-r 


(Bias) 2 - B'[S p „ r S'. r j'|-S p . r S^ r ]6 
^ S p-r S p-r S p-r S p-r^ 


and therefore. 


S p-r S p-r l p-r 
(Bias) 2 = B'S p _ r S^_ r l3 


Now Tp_ r = Sp_ r 3 is the (p - r)-element vector of projections of 3 onto the 
subspace spanned by Sp_ r - In this notation 

(8i«) 2 * -W 


■s 


i=r+l 


Since t. does not depend on r, we have the result that (Bias)^ is a monotonic 

J 2 

decreasing function of r. Furthermore, (Bias) has the limiting value 0.0 

for r = p and the value 3'3 for r = 0, since TpTp = 3'3. 
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THEOREM 8: 


A sufficient condition for the mean-square error E[L^] to be less than the 
least-squares variance is 

P 

Z)jrr > T (B ' B ) (i 

j=r+l J 0 


PROOF: 


; ing eqs. (23) and (24), E[L!p is given by 


^-■lE* + E 


j=r+l 


while the least-squares variance is 


2 ^ 1 


Var(b) = 


Thus, a- neccesary and sufficient condition for 


:[l 2 < Var(b) 


is that 


E A>E 


j=r+l J / j=r+l 


j=r+l 


Eft 


Thus, a sufficient condition is that 


2 9 

J- > ti ; j > r 
J J 
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Since eq. (26) would be difficult to apply in practice, a less stringent 
but more useful sufficient condition can be obtained by noting that 


P P 



j=l j=r+l 


for any r < p. Thus, the inequality can be written as in eq. (25). This 
result corresponds to the following theorem. 


THEOREM 9: 

If 6'£ is bounded, then there exists a k > 0 such that the mean-square error 
of the ordinary ridge regression b(k) is less than the mean-square error of 
the least-squares estimator (ref. 28). An important theorem due to Marquardt 
(ref. 36) with some interesting, interpretation follows. 

THEOREM 10: 

Let Y r be the angle between ‘b m (r) and g = X'y. Then > y -j (r an 
integer) whenever and satisfy the inequalities (0 < £ ), (£ /£„ -| « 1)» 
and (£ -,/£ ? « 1). Since g is independent of b (r), it follows that b (r) 
rotates toward g as r is decreased under these conditions. 


4.3 SHRUNKEN ESTIMATORS 

Mayer and Will ke (ref. -30) discuss several families of biased estimators which 
may be labeled shrunken estimators. They consider the model y' = X£3 + e, but 
do not require that X'X be in correlation form. Each family of estimators is 
indexed by a parameter c, 0 <‘c < 1 -and defined by 

b $ (c) = c(X'X)~^X‘y = cb 

where b is the ordinary unbiased least-squares estimator. If the constant c 
is a scalar fixed in advance' of the analysis, then £> s (c) is called a 
deterministically shrunken estimator. If c is a scalar function of the 
least-squares estimator, then b s (c) is called a stochastically shrunken 
estimator. 
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Hoerl and Kennard (ref. 28) justify the use of the ridge estimator in non- 
orthogonal problems in two ways: (1) They show that, for a fixed k, b(k) 

corresponds to the point on a fixed ellipse of concentration of b which has 
minimum Euclidean length and (2) they show that in any given problem the 
class of ridge estimators satisfy the following admissibility condition: A 

class of estimators E will be called (mean square) admissible "if for every 
problem there is an e in E such that M(e) < M(b) = Var(b). 

Although the shrunken estimator b s (c), with shrinkage factor c, may seem a 
rather simplistic alteration of b, the following proposition proved in 
reference 30 shows that these satisfy the admissibility condition presented 
above. 

PROPOSITION 1:. For every B. there exists a fixed c in [0, 1] such that 

M[b s ( c) ] < M(b) and thus the subclass of deterministically shrunken estimators 

is admissible. 

Consider the stochastically shrunken estimator b s (c)., where 

c = [1 - qS^(b'b) 

s 2 .= y'y - b 1 (x'x)"^b ; p > 3 

and 


0 < q < 2(p - 2)(n - p + 2) ^ 

This estimator is one discussed by Sclove (ref. 27). Defining 
W[b s (c)] = Ej[b s (c) - b]'[b s ('c) - bij- 
it was shown by Sclove that 

P - 2 

^ ^0 n - p + 2 

minimizes W[b s ( c ) ] . This is the only biased estimator known to this author 
for which a choice of biased estimator can be explicitly given which guarantees 
a reduction in mean-square error. 



Let C denote the class of linear transforms of b, and let t = Ab for some 
p x p matrix A. Note that if we let t(A) = Ab for fixed A, then 

E[t(A)] = AB 
Var[t(A)] = a 2 A'S _1 A 

M[t(A)] = a 2 trA'S _1 A + B‘(A - I)' (A - I)B 
and the sum-of-squares loss associated with t(A) is 
L(A) = [y - Xt(A)]'[y - Xt(A)] 

= (y - Xb) ' (y - Xb) + b‘(A - I)'S(A - I)b 
= L(b) + L*(A) 

Since L*(I). = 0, L(A) is minimized by letting A = I, which yields .the least- 
squares estimator. However, if L*(A) > 0, then' the mapping from the space 
of p x p matrices to the real line defined by y (A) = L*(A) maps an entire 
class of matrices to the same value. The preimage of any fixed constant 
rQ consists of all p x p matrices, satisfying 

b' (A - I ) ’ S (A - I)b = r Q 

Let C(rg) denote the subclass of C such that t(Ag) is in C(rg) if and only 
if L*(Aq) = rg. C(rg} is actually an equivalence class, the equivalence 
being defined with respect to the ’sum-of-squares loss function. It can be 
shown that both ridge estimators and the deterministically shrunken estimators 
can be characterized as minimum normal estimators in the class C (ref. 30).. 
Suppose the criterion for selecting an estimator from an equivalence 
class is to choose the estimator which' has minimum -Euclidean length (normal). 
Let ' 

’ m(A) = t 1 (A)t(A) = b'A’Ab 

denote the squared Euclidean length of t(A). Mayer and Will ke (ref. 30) 
have proved two propositions that link the ordinary ridge estimators and the 
shrunken estimators. 
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PROPOSITION 2: . If A g = (kS + I)" 1 for some k and t(A g ) is’ in C(r g ), then 

m(A n ) = min m(A) 

0 C <>0> 

This proposition states that within its equivalence class the ridge estimator 
is the shortest. estimator, provided m(A) is the norm used to measure length. 
Now consider the design dependent norm 

m d (A) = t' (A)St(A) = b'A'SAb 

and suppose the optimal estimator in an equivalence class is defined to be 
the estimator with minimum length as measured by m d (A). 

PROPOSITION 3: If A ] = cl for some c in [0, 1] and b s (c) belongs to C(r g ), 

then 

m .(cl) = min m .(A) 
d C(r 0 ) d 

Since t(A^) = cb = b s (c) we have shown that both ridge estimators and the 
shrunken' estimators are minimum length estimators with respect' to the 
appropriate norms: 

For the choice of deterministically shrunken estimator b $ (c) 

V!b s (c) =,cV!b.= ~V!X‘Y 

E[V!b s (c)] = cV’b 
and 

Var[V!b s (c)] = £j£- 


¥ 


= cb we have 


(27) 


(28) 
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From eqs. (27) and (28) we observe that, unlike the ridge and generalized 
inverse estimators, all linear combinations of the parameter estimates. are 
driven toward zero proportionately and that the variances are also propor- 
tionately reduced. 

The mean-square error of the estimated regression function is given (ref. 47) 
by 

M p[ y 0 |b s (c)] = E[(y Q = X Q b)'(y 0 - X Q b)] 

- e{[(c - l)X Q b + cX 0 (X‘X)' 1 X l e]' [(c - 1 )X 0 b 
+ cX 0 (X'X)~ 1 X , e]| 

= -E[c 2 e;x(X , X)~ 1 X^X 0 (X , X)" 1 X'eJ 
+ 2E j^(c - l)cb'X ( i ) X 0 (X , X)~ 1 X , e] 

+ E[(c - l) 2 b'XJX 0 bj 

For stochastically shrunken estimators, these expectations may be somewhat 
difficult. For a deterministically shrunken estimator, c is a constant and 
Mp[b $ (c)] is easily found to be 

M p [y 0 ! b s ( c )] = c 2 E[e , X(X , X)" 1 X^ ) X 0 (X'X)” 1 X t e] 

+ (c - 1 ) 2 b'X^X^b 
= Y][b s (c)] + Y 2 Cb s (c)] 

The following three theorems are due to Sidik (ref. 47). 

THEOREM 11: 

The variance function Y-|[b s (c)] is a monotonically increasing function of 
c > 0 and Y-j[b s (l)] >0. 
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THEOREM 12: 


The bias function ¥ 2 ^( 0 )] is a monotonically decreasing function of c for 
0 < c < 1. 

THEOREM 13: 

MpEy Q |b s (c)] is. initially decreasing as c decreases from c = 1, and there is 
a unique minimum for some 0 < c < 1. 

Theorem 13 states that an optimal choice of c exists. However, this optimal 

2 

value of c will be a function of a and b. 

Several authors (refs. 51 and 52) have considered different ways of unifying 
the study of biased estimators in an effort to determine their relative 
merits. Obenchain (ref. 52) has considered .the problem of testing whether 
ridge analysis may be useful. He defines the shrunken statistic that is used 
to decide if ridge analysis should be used or not. 

4.4 PRINCIPAL COMPONENTS REGRESSION 

A particular type of Marquardt's generalized inverse estimator is the 
principal components estimator, which involves an orthogonal reparameterization 
of the values of the regressor variables through the following procedure. 

Let S be the orthogonal matrix whose columns are the eigenvectors of X 1 X 

and let L be a diagonal matrix whose diagonal elements are the eigenvalues 

of X'X. If we also let Z = XS, then the jth column of Z, z., is called the 

. J 

j th principal component of X for j = 1, 2, p. 

The response variable .is now regressed on the principal components z., 

J 

rather than on the original variables x.. In place of the usual regression 

J « 

model 

y " ,X3 + e 
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now we have 


where 


y = Zy + e 


Y = S 1 3 

Using least squares, we obtain 

g = (Z 'lyh'y = L _1 Z’y (29) 

If all components are retained in the model, the estimates of the regression 
coefficients when transformed from g back to b through b = Sg will be identical 
to the least-squares estimates. 


Use of the procedures discussed above would hardly be necessary when the 
beta vector could be estimated directly by classical methods. At least two 
situations arise, however, in which ordinary least-squares is not appropriate 
(ref. 32): (1) when the independent variables are collinear with one another, 

making inversion of the correlation matrix impossible and the elements of 
beta indeterminate; and (2) when, because of high (but not complete) 
col linearity or for some other reason, it is desirable to collapse the 
independent variable space by deleting one or more principal components from 
the regression relationship. We are mostly concerned with the second case. 

To overcome the effects of mu Iti colli near ity on the least-squares estimates, 
the procedure in principal components regression is to delete from the 
analysis those components corresponding to small eigenvalues of X'X. The 
regression analysis is then performed using least squares on the remaining 
components. If s (1 < s < p) components are deleted, we can partition 

s = [S t : S s ], y‘ = CyJ. : y^] 
and 

L - [L t • L s ] 
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(30) 


where t = p - s. From eq. (29) we have 

z t = L t S t X ' y 

or, in terms of estimates of the original coefficients, 

V = s t L t ls t x ’ y 

By inserting (X f X)(X'X)^ into eq. (30) we have 


b = S.L^X'Xb 
pc t t 

Marquardt (ref. 36) has shown that mean-square error (MSE)(bp C ) < MSE(b.) 
if and only if 


P 



j=t+l 


VVs 6 

O 


so that, as with b(k) and b s (c), there is potential for improvement in MSE 
when compared with the least-squares estimator. 


A major problem with the use of principal components regression is deciding 
which components to delete. Two criteria are usually considered: 

a. Delete components associated with- small eigenvalues 

b. Delete components which, are relatively unimportant as predictors of 
the response variable y. 

Mansfield (ref. 53) has shown that the F-statistic used for measuring the 
predictiveness of a component associated with a small eigenvalue is 
unreliable and can lead to poor results. Mansfield recommends deleting all 
components associated with small eigenvalues, and he also provides a method 
of variable selection following principal components regression. 


Marquardt’ (ref . 36) points out the assumption of an integral number of 
zero eigenvalues of X.'X may be overl'y restrictive (see section 4.2). He 


4-33 



notes that in the case where X'X is actually of rank t, eq. (30) is the 
Moore-Penrose generalized inverse solution to the normal equations. In the 
case where X'X has full rank p but has several small eigenvalues, Marquardt 
suggested the concept of fractional rank of X, that is, we assume X to have 
rank f where t < f < t + 1 and use the generalized inverse 

<* ,x ) + "s t L;VI^St + 1 s i+1 

The principal components estimators depend upon the particular method used 
for determining the significance of the coefficients. The MSE of the 
predicted regression function is not considered in this paper. Two different 
procedures for subset regression in the principal components case were 
considered in references 54 and 55. 

The method of principal components regression is further discussed in' 
references 31, 32, 56, and' 57. 

c 

4.5 LATENT ROOT REGRESSION 

One of the most important issues (ref. 32) in principal components regression 
is the criteria to be used in choosing a subset. There are at least two 
alternative criteria for deleting components-: 

a. Delete the components that are relatively unimportant as predictors 
of the original independent variables in the problem; i.e. , the 
components having the smallest eigenvalues should be dropped. 

b. Delete the components that are relatively unimportant as predictors of 
the dependent variable y in the problem. In this case, the components 
having the smallest values of' the correlation between the components and 
y should be dropped. 

Hotelling (ref. 58) has noted that in general there is no reason why 
components that are important as far as the independent variables of a 
problem are concerned will be highly correlated with the dependent 
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variable in a regression, so criteria a and b above are likely to lead to 
different results. Furthermore, it is easily shown that y need not be 
highly correlated with components having large eigenvalues in order for the 
explanatory power of the complete principal component regression to be high. 

The choice of criteria must rest with the purpose of the analysis, as well 
as the degree to which the principal components results can be interpreted 
in terms of the structure of the process underlying the data for the 
independent variables. If the first few principal components can be related 
to something "real, 11 as is hopefully the case in factor analysis, for example 
then it may make sense to retain them as explanatory variables in a principal 
components-regression analysis, regardless of their correlation with the 
dependent variable. Massy (ref. 32) claims that components with large 
eigenvalues are usually the ones most likely to yield natural interpretations 
Conversely, if the emphasis is on finding the correlates of y rather than 
testing its relation to any particular structural concepts, it would seem 
to make more sense to adopt criterion b and retain those components with 
the highest values of the correlation coefficients between the components 
and the vector y. This is often the case in purely exploratory studies. 

Latent root regression is a procedure for implementing principal components 
regression by using criterion b above; this analysis was first suggested by 
Massy (ref. 32) and developed independently by Hawkins (ref. 59) and 
Webster, Gurst, and Mason (ref. 60). It is a modified least-squares 
procedure which uses the eigenvalues (latent roots) and eigenvector (latent 
vector) of the correlation matrix of response and regressor variables. 

Analysis of these eigenvalues and vectors will enable the experimenter to 

a. Identify multicollinearities among the regressor variables 

b. Determine whether the multicollinearities have value in predicting the 
response variable 



c. Obtain modified’] east-squares estimates of the regression coefficients 
through a procedure which adjusts for nonpredictive multicoil inearities. 

A stepwise backward elimination of variables was developed (>ref. 60), using 
ordinary least squares or the modified procedure. Using the model y = X6 + e 
where the vector y and the matrix X have been standardized, let 


t 2 -± (y, - y) 2 

i=l 

and define the matrix A = [y I X]; i.e., the (n x p + 1) matrix of standardized 
dependent and independent variables. A'A is the extended correlation of 
dependent and independent variables and has eigenvalues and eigenvectors 
defined by | A 1 A - = 0 and (A'A’ - A ^ I ) V J = 0; j = 0, 1, p. • Denote 

the elements of the jth eigenvector by 


v j ' (S 0 J» 


“lj, 


V 


and let 


Also let 


V 'j {S lj 5 S 2j ’ **'’ S pj } 


S (Vg, V-| , * ’ * , Vp) 


and 


L = diagUj) ; j = 0, 1, 2, p 

where c < ••• < £ . Hence 
u - 1 - p 

S' (A'A)S = L 
and 


A’A = SIS' 
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Nov; note that the j.th column of AS can be expressed as 


P 

y l S 0j + 2 x lk S kj 
k=l 

P 

^2 S 0j + S x 2k S kj 

AV. = k=1 

J 


P 

y n S 0j + 2} x nk S kj 
k=l 

We also note that the j th eigenvalue of A' A can be expressed as 

= Vj (A'AJVj- = (AVj) 1 (AV j ) 
n p ’ 

= 2 ( y i s 0j + 2 x ik S kj^ 2 ( 31 ) 

1=1 k=l 

Thus i. is the sum of squares of the jth set of linear combinations of 
J 

response and regressor variables which is provided by the jth column of AS. 


If l. =0 for any j = 0, 1, •••, p then each term in eq. (31) is equal to 
J 

zero and an exact linear relationship exists among some or all of the 
columns of A. If the corresponding f 0, a perfect predictor exists of 
the form 

f. 

P 

y i y ”- tS 0j x ik S kj 
k=l 

If = 0 and = 0, we see from eq. (31) that an exact linear dependence 
(exact multi col linearity) exists among the columns of X, the relationship 
being 
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p 

£ x ik s kj = 0 ; 1 - '• n 

k=l 

In general, none of the eigenvalues will be zero, but some may be quite 
small. Small but nonzero eigenvalues indicate near singularities. Notice 
from eq. (31) that if we have ^ 0 then each term in the sum must be near 

J 

zero and we will have 


S 0j y i 


E 

k=l 


S kj x ik 



(32) 


If in addition Sgj ~ 0, we have a multicollinearity involving only the 
regressor variables and not the response variable, the relationship being 


£ Vik 


k=l 


~ 0 


1 = 1 , 2 , 


n 


Since this relationship does not involve the response variable., it would 
be of little value for prediction. 


Let us see a geometrical interpretation of a nonpred.i'ctive multicollinearity. 
Consider the n data points (y^ , , x^» **■> x^ ) i = 1, 2,, •••, n as n 

points in the p +1 dimensional Euclidean space defined by the mutually 
orthogonal axes Y, X-j , *■*, Xp. The eigenvectors of A 1 A define a second 
set of mutually orthogonal axes Zg, Z-j , Z^, where Z^ is the axis defined 

by V. , i - 0, 1, p. The direction of axis Z. relative to the original 

' J 

axes is given by the vector sum 

P 

£ Vk ; J = °- '• •••’ p 

k=0 

where eg, e^ , •*•, e p are unit length vectors from the origin in the direction 
axes Y, X-,, X ? , ••*, X . The first element of V. represents the cosine of 
the angle between axes Y and ly whi ,e s kj (k = .1, 2, v, p) represents the 
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cosines of the angle between axes X. and Z.. Assuming the eigenvectors are 

J 0 

normalized and the eigenvalues are distinct, V. is uniquely determined apart 

J 

from a multiple of -1. 

The eigenvalue corresponding to a particular eigenvector measures the 

spread of the n data points in the direction defined'by the eigenvector. 

In other words, Z. is the sum of squares of the projections of the n data 
J 

points on the Z. axis. A small value of indicates that there is' little 
J 3 

variability in the Z. direction, i.e., 

j 

p 

2 ij = y i s oi + £ x ik s kj 
k=l 

is near zero for i = 1, 2; • •• , n. If Sqj is near zero, -the axis is 
nearly orthogonal to the Y-axis.* Hence, if both and S Q j are small, the 
eigenvector V. reveals a nonpredictive near singularity; a strong linear 
dependence only among the independent variables which produces little or 
no change in the dependent variable. The situation where £- is small but 

. > J 

S 0j - is not small, so that the response variable is involved in the * 
relationship, is termed predictive mul ticollinearity. The ability to detect 
the presence of predictive and nonpredictiye multicol linearity and' to 
determine the nature of the relationships through eq. (32) is one of the 
key features of the latent root regression procedure. This, feature is not 
shared by any of the other procedures outlined in the previous sections. 

The least-squares estimator is a linear combination of all p + 1 eigenvectors, 
including eigenvectors corresponding to nonpredictive near singularities. 

The modified least squares (ref. 60) utilize only linear .combinations 
of the eigenvectors not having both £. and S^. small. In this fashion, 
the estimates of the regression coefficients are adjusted for the effect 
of nonpredictive near singularities. 

Figures 4 and 5 illustrate, for three dimensions and four data points, the 
cases of predictive and nonpredictive mul ticollinearity. Nonpredictive 
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multicoll inearity characterized by a small £q and small |Sqq| is shown in 
figure 4. The case in which is small but j S 0Q | is large is illustrated 
by figure 5. 

Hawkins (ref. 59) arrives at similar conclusions but from a different point 
of view. Let (y^ , x^-j, — , x^) be the i th data point on the p + 1 
dimensional space spanned by y, x-j , *" s x p anc * ^ et 

y - b-jX-j - b 2 x 2 - - bpXp = 0 (33) 

be a fitted hyperplane. Hawkins considers measuring deviations of the n 
data points from the hyperplane (eq. (33)) in the direction of the normal 
to the hyperplane rather than in the Y-direction. 

If we let 

A = mean-squared deviation between fitted and observed responses in 
the direction of the normal line 

(A,.-D 2 

1=1 

where A- is the deviation of the i th data point in the direction normal to 
the fitted plane, and 
2 

d = mean-squared deviation between fitted and observed responses in 
' the direction of the Y-axis 

d2 - IT, <n - (d i - d > 2 

i=l 1=1 

v/e then have 

A = (cos 2 e)d 2 (34) 

where 0 is the angle between the normal to the hyperplane and the Y-axis. 

2 

Hawkins calls A the vertical norm and d the Y-norm (see fig. 6). He proposes 
A as an alternate measure of the fit of the hyperplane (eq. (33)). 




From eq. (34) we see that 
2 

a. If d is small, indicating the hyperplane provides a good fit to the 
given points, then the vertical norm A is also small. 

2 

b. If A is small, d is not necessarily small as cos 8 may be small. This 
would correspond to a multicol linearity among the regressor variables. 

2 

Hawkins also notes that, for a hyperplane chosen to minimize d , the vertical 

i 

norm will be equal to £q, the smallest eigenvalue of A' A, and will be in 
the direction defined by the corresponding eigenvector, Vg. Thus we see 
that the Zq axis (ref. 60) will be in the direction normal to the fitted 
hyperplane and that cos 0 = Sgg, where Sgg is the first element of Vg. 

Thus the nonpredictive multi coll inearity characterized by a small Zq and a 
small S Qg is the same as characterized (ref. 59) by a small vertical norm 
by a large Y-norm. This is illustrated in figure 6i If a second vector is 
chosen so as to be orthogonal to Vq and to minimize the vertical norm, that 
vector will be and the vertical norm is now , the second smallest 
eigenvalue of A' A. 

Now consider the problem of estimation. If all Sg^ f 0, then eq. (31) will 
provide p + 1 prediction equations of the form 

y j = 7]_ - ts'lxv9 ; j = 0, 1, •••, p (35) 

where 1_ is an m x l vector of l's. 

Normally, none of the individual equations in eq. (35) will by itself be a 
good predictor. Linear combinations of these predictors, therefore, will be 
used to obtain estimates of the parameters of the model. Consider the • 
following arbitrary linear combination of the predictors (eq. 35)): 

* = z Vo/ 

j=0 




4-44 



Imposing the restriction 


yields 


P 


D a j s oj 

j=0 


= 1 




The residual sum of squares using this predictor is 


(y - y)'(y - J) = t 2 a'L a = t 2 £ fa 

J-o 

where a 1 = (a Q , a^ , •••, a p ). 


(36) 


If a is now chosen to minimize the residual sum of squares subject to the 
above restriction, eq. (32) will yield the least-squares estimator. Thus 
we wish to minimize 


P(a) - t 2 V a 2 *. - 3 .s oi - l) 

j=0 \j=0 


(37) 


where -2 q is a Lagrangian multiplier. The solution is (ref. 60) 

,-l 


p \-i 

a 3 = Vj' I ; 

k=0 


(38) 


where H = A./sf:,. 

J J J 


From eqs. (36) and (-38) the least-squares estimator of the regression 
coefficients is then given by 
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b = -t E *k 


-1 


-1 P 


,-l„0 


>o 


y s n .at 1 vi 

^ Oj 0 J 

j=o 


(39) 


" -t E ^ 

j=0 

with residual sum of squares error (SSE) 

SSE - A Z 
\ k=0 

Suppose now that eigenvectors V Q , V-j , V R-1 correspond to nonpredicti ve 
near singularities. An obvious modification of the above procedure is to 
take a linear combination of the predictors in eq. (35) except for those 
which correspond to nonpredictive mu Iti coll ineari ties . We should then 
expect to obtain improved estimates of the regression coefficients without 
losing very much of the ability to predict the response variable y. 


The above least-squares estimator can be adjusted by setting a Q 

7) 

-1 


a-, = 


= a^ -j = 0. Then minimizing eq. (37) yields 


a j = S 0j j! 'j 1 ( E £ r _1 1 ; 3 = k, k + 1, •••, p 

.r=k 

so that the modified least-squares coefficients are 


r=k 


with residual sum of squares 


sse lr = nE 


j-k 


-1 


-1 


r=k 


(40) 
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Note that if X is not of full column rank, i.e., X'X is singular, this same 
procedure can be applied and minimization of eq. (37) will yield results 
identical with eq. (40). This follows from the fact that a singular matrix 
X'X implies some of the £. and corresponding of A'A will be zero- 
equivalent to setting the appropriate a. in eq. (37) to zero. Hence 

J 

solutions to the normal equations can be obtained from this procedure 
regardless of whether X is of full column rank. 

The estimates obtained from eqs. (39) and (40) are often strikingly different 
when X is near singular. One reason for this is that the a. corresponding 

J 

to eigenvectors revealing nonpredictive near singularities are often large 
relative to the remaining a^. When this occurs, the terms a^-Vj, o = 0, 1, 

•**, k - 1 can dominate b. Removing these dominating terms will then yield 
more accurate estimates of the true parameters 3. The latent root estimator 
is then a linear combination of vectors essentially orthogonal to the subspace 
defined by the nonpredictive multicollinearities and hence may yield more 
accurate estimates, depending on the orientation of 3 relative to that 
subspace. (See refs. 33 and 60 for examples.) 

In this example, with n = 12 and p = 6, = 0.001 and = 0.0287 while 

the remaining eigenvalues are larger than 0.3. The corresponding Sqj are 
0.0339 and 0.6987 so that Vq indicates a nonpredictive multi col linearity 
while V-| does not. Since £-j is small and S^ is the largest Sq^, V-j 
provides more information about the underlying model than any of the other 
eigenvectors. In this example, the latent root estimator is formed by 
removing Vq from the analysis. The aj for least squares (LS) and for 
latent root (LR) are then: 



fo 


a^ 

®3 

V 

f5 

!6 

LS 

1.760 ' 

1.317 

0.012 

0.002 

0.017 

0.015 

0.004 

LR 

0 

1.404 

0.013 

0.002 

0.018 

0.017 

0.004 
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Notice that the least-squares procedure gives 56 percent of the total weight 
to V Q and only 42 percent is given to V^, the vector providing the most 
information about the model. 

The latent root procedure gives zero weight to Vg and 96 percent of the 
total weight to . The least squares estimates of the parameters of model 
y = X0 + e are given in the first row of table I with the true values of the 
parameters in the third row. The matrix A'A, the extended correlation matrix, 
is given in table II. The eigenvectors of A'A-are given in table III. Note 
that the four estimates of parameters involved in the near singularity are 
moderately large negative values. The fact that these estimates are similar 
despite the differences in their true values of the parameters is indicative 
of the effect of the near singularity. Using the modified least-squares 
procedure by computing a linear combination of all the eigenvectors except 
Vg yields the estimates in the second row of table I. The absolute values 
of the first elements of the eigenvectors and the eigenvalues are given in 
table IV- 


TABLE I..- LEAST SQUARES AND MODIFIED LEAST SQUARES 



b JL ■ 


• ^ 


• H 


Si 

LS 

-6.0378 

-8.472 

-10.1435 

-11.7271 

4.0967 

9.4506 

1.2762 

LR 

2.5447 

-0.3982 

0.2416 

-0.7348 

4.2125 

9.4914 

1.3575 

True values 2.000 • 

1.000 

0.2000 

-2.000 

3.000 

10. COO 

1.000 


TABLE II.- 

A'A, THE ‘EXTENDED CORRELATION MATRIX- 



. £ il 

h 

^3 

h 

^5 

^6 



1.000 0.252 

-0.099 

0.217 

-0.339 

0.364 

0.811 



1.000 

•0.052 

-0.343 

-0.498 

0.417 

-0.192 




1.000 

-0.432 

-0.371 

0.485 

-0.317 





1.000 

-0.355 

-0.505 

0.494 






1.000 

-0.215 

-0.087 







1.000 

-0.123 








1.000 
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TABLE III.- EIGENVECTORS OF A'A 


i 

hi 

hi 

fii f3i 

S . 
3L 

s 

hi 

hi 

6 

0.1653 

-0.3300 

-0.4471 0.5165 1 

0.1009 

-0.4370 

0.4427 

5 

.6006 

.3444 

.0925 .1436 

-.4785 

.3294 

.3925 

4 

.3406 

-.1134 

-.1886 -.4556 

.6518 

.3303 

.3068 

3 

.0388 

-.6944 

.6712 .0937 

-.0417 

.1427 

.1869 

2 

.0713 

.2344 

.3550 -.4505 

-.0128 

-.7033 

.3410 

1 

.6987 

-.1694 

.0242 -.0091 

.0453 

-.2766 

-.6355 

0 

.0339 

.4402 

.4229 .5416 

.5763 

-.0071 

-.0276 


TABLE IV.- INDEXES FOR STANDARDIZED PREDICTION 

EQUATIONS 


r- 


0 

1.2 3 

4 

5 

6 

V 


0.0010 

0.0287 0.3115' 0.9178 

1.1150 

2.1816 

2.444 

l s 0j 

h 

.0399 

.6987 .0713 .0388 

.3406 

.6006 

.1653 

A jt : 


19.1496 

14.3352 .1347 .0249 

.1798 

.1620 

.0398 

White (ref. 

61) studies the problem of deciding whether 

an eigenvector of 

A'A 

should 

be removed 

from the analysis. Upon first consideration 

, it 

appears that the problem centers on deciding when |S Qj . | 

and £. are 

J 

small 


enough to indicate the presence of nonpredictive multicollinearities. White 
proposes that a more crucial consideration is the .orientation of the true 
coefficient vector, 8, in the p-dimensional subspace spanned by the 
eigenvectors of X'X, the correlation matrix. His proposal is plausible if 
we are willing to accept that 

V ? % Vi ; j = °» ■>’ •••» k - 1 

where 

V.; j = 1 , 2, •**, p = the eigenvectors of X'X 
J 

k - 1 = the number of nonpredictive multicollinearities 
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5. APPLICATIONS 


The author used the procedure of combining some of the good features of the 
several biased techniques (section 4) and the unbiasness property of the 
ordinary least-squares estimator, using weather data for Oklahoma and Texas. 
In addition, trend data were available for Oklahoma. The weather variables 
are the following: 


Variabl e 


10 


'll 


12 

( 13 

<14 

<15 


1 6 


'17 


Name 


Type 


Precipitation for current year 


Precipitation for previous year 


Mean temperature 
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Variable 

Name 

I 

X 18 

January \ 


X 19 

February 


X 

JV> 

o 

March 

/ 

> Percent evapotranspiration 

X 21 

April 

i 

X 22 

May 


X 23 

June 1 

! 

X 24 

Trend 

Trend 

y 

Yield 

Yield 


The model postulated Is 

24 

y = f* 0 + Y, B,X, + e 
i=l 

and if the' matrix is standardized, then 

y = XB + e 

The 45 data points consist of weather information from 1932 to 1976, inclusive. 

The first task is to reduce the number of variables in a meaningful way. 

The all -possible regressions procedure (see section 3.1) was used to analyze 
all possible subsets of variables. The optimum number of variables that 
should be kept in the model was determined by the adjusted R (upper bound) 
and the Mallows' (lower bound). (See sections 3.2 and 3.3.) 

The following results were obtained from the all-possible regressions 
approach. 

OKLAHOMA 

? 

Ten variables were selected by using the adjusted R as a criterion of 
goodness of fit, X-j , X^, Xg, Xg, Xg, X-j-j, ^14’ ^17’ ^20 J ^24" 

' 5-2 


(A 



The Mallows' C criterion selected- seven variables: X 0 , Xr, X,-, X,., X-,-,, 

p o 0 0 0 14 1/ 

X 2 Q> and X 24 . The highest R possible (using all 24 variables) is 91.65. 

Ten Variable Results 

Two small eigenvalues of the extended correlation matrix, Jl-j = 0.004389 and 
&2 = 0.073157, were obtained, indicating two multicollinearities. Their 
respective values of [ j and [ Sq^ I are 0.091927 and 0.71778, indicating 
that the first eigenvectors correspond to a nonpredictive near singularity. 

In fact, the second eigenvector provides the most information about yield. 

This situation is very similar to the example given in section 4. The 
second eigenvector is 

= [-0.04957, -0.16844, 0.29297, 0.21722, -0.07449, 0.06980, 

0.05778, 0.17312, 0.2233, -0.47684, 0.71778] 
so the following equation holds. 

-0.04957X-,' - 0.16844X 2 + 0.29297X 3 + 0.21722X 4 - 0.07449X 5 

+ 0.06980X g + 0.05778X y + 0.17312X g + 0.2233X g - 0.47684X-, 0 

+ 0.71778Y = 0.073157 

Notice that yield, y, is heavily involved in the multi col linearity. 

2 

The computed value of R is 89.22 and the value of the determinant of the 
correlation matrix, |R|, is 0.003211. 1 / f R | is called the generalized 
variance (ref. 56). The size' of |Rj indicates possible instability in the 
estimates of the parameters. 

Thus far, latent root regression techniques have been used to determine the 
source and type of the multicollinearities present in the data; now the 
problem is to remove the multi coll inearity by deleting one or more variables. 
Looking at the first eigenvector (which corresponds to the nonpredictive 
near singularity), observe that two components are larger than the other; i.e. 
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Vj = [0.019389, 0.004192, -0.01999, -0.025515, 0.018872, 

-0.004851, -0.716423, -0.009117, 0.687389, 0.062269, -0.091927] 

Again, the interpretation is similar as the equation above, but here the 
variable yield is not involved in the multi col linearity; only variables 
X-j^ and X 2Q are involved in the nonpredictive near singularity. The 
weights given to X^ and X 20 are -0.716423 and 0.687389, which are much 
larger than the other components in V-j . 

We delete X^q rather than X^ because X^g is more correlated with yield 
than X-|^. Alternatively, the variable that least decreases could be 
deleted. 

Nine Variable- Results 

With variable X 2 q' deleted, we have only one small eigenvalue of A'A, 

&1 = 0.066375. Since | S Q -j | = 0.70785, the eigenvector V-j corresponds to a 
predictive near singularity. . This eigenvector is 

V-j = [0.05068, 0.161875, -0.28525, -0.21834, 0.073801 , -0.069854, 

-0.291657, -0.169026, -0.466021, -0.707851] 

The interpretation of this eigenvector is as before. The computed value of 

p 

R is 87.33, so the net loss in goodness of fit is 1.89. The value of 
J R | is 0.2967, which is much higher than the previous value of 0.003211. 
This new value of |R| is an indication of stability of the parameters 
estimated; i.e., the variances of the estimates are not too large. 

Let us see if the current results can be improved by introducing some bias 

to the estimator (generalized ridge procedure). The determinant of R went 

2 2 
up to 0.3518 and the estimated R is now 87.02, so a loss of 0.308 in R 

gives an improvement of 0.011. This result seems to be adequate as an 

initial start in the modeling of wheat yield. The next step should be to 

consider interaction and square terms. The resulting values of the 

parameters are: 
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b-j = 0.01005 
b 2 = 0.02819 
b g = -0.02465 
b g = -0.02182 
b g = 0.01048 
b-j-j = -0.01537 
b-, 4 = -0.60445 
b 17 = -0.51368 
b 24 = 0.26523 

The value of bg, the intercept, needs to be calculated by using the sample 
means of the X's and Y. 

TEXAS 

2 

Ten variables were selected by using the adjusted R as a criterion of good- 
ness of fit. They are: Xg, ^10* ^11 5 ^12* ^13’ ^14* ^16 s ^18 3 ^19 3 
X 22 . Trend, X 24 , is not available. Mallows' criterion selected five 

variables: Xg, X,-, X 7 , X^, and X^ g . If all 23 variables are used, the 

computed value of is 53.458. 

Ten Variable Results 

The A'A matrix has three small eigenvalues: = 0.000332, & 2 = 0.005177, 

and 5.^ = 0.060468, so we have three near singularities. The respective 
values of JS Qi { are: | S Q1 | = 0.004814, |S Q2 | = 0.025483, and |S Q3 1 = 0.082601, 

indicating that we have three nonpredictive near singularities. The 
determinant of R, |R[, is 0.00000054 and the computed value of R is 49.45. 

The vector V-j is 

V-j = [0.005817, -0.008772, 0.002736, -0.012504, 0.014695, 

-0.00475, -0.706582, 0.015953, -0.020851, 

0.706764, 0.004814] 

By observing the eigenvector V-j , we see that the weights for X-jg and X 22 , 
-0.706582 are 0.706764, are clearly larger than the other weights. Therefore, 


i0 
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X^g and X 2Q are candidates for deletion. We delete X-jg because the correlation 
of X-| 6 and yield is lower than the correlation of X 2Q and yield. 

Nine Variable Results 

Delete X-jg, the A'A has two small eigenvalues: £-| = 0.005177 and = 0.060408. 

Their respective values of |Sq. | are: I s q-] i = 0.005177 and jSg 2 | = 0.060408. 

the value of | R j is 0.00079055 and the value of R 2 is 47.714. The candidates 

for deletion were X-^ and X^; X-jg was deleted by the same reasons as before. 

Eight Variable Results 

Delete X-jg, and A'A has only one small eigenvalue, = 0.06354, which 
corresponds to one nonpredictive near singularity. J R [ is 0.0714075 and R 2 
is 43.88. The candidates for deletion are X 12 and X^. Variable X-j 2 was 
deleted. 

Seven Variable Results 

p 

Delete X-j 2 , and A'A has no small eigenvalues, 1 R j is 0.5797, and R is 42.72. 
This value of | R j is excellent and the loss in R 2 has not been too great. 

In this case, there is no need to enter the biased estimation procedure; 
therefore, the unbiased least-squares estimator is used to estimate the 
parameters of the yield model as 

b g = -0.05516 
b 1Q = 0.07096 
b n = -0.06655 
b ]3 = -0.57036 
b^ = -0.70483 
b 18 = -0.43672 
b 22 » -0.1944 
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6. -CONCLUSIONS AND RECOMMENDATIONS 

It was observed that OLS is not adequate as an estimation procedure when the 
independent or regressor variables are involved in multi col linearities. 

This was shown to cause the presence of small eigenvalues of the extended 
correlation matrix A'A. -It has been demonstrated that the biased estimation 
techniques and the all -possible subset regression can help in finding a 
suitable model for predicting yield. 

Latent root regression is an excellent tool that allows us to find how many 
predictive an nonpredictive multicollinearities we have, and it also tells - us 
exactly what variables are involved in the multicollinearities. Thus, ,we can 
decide what variables to drop from the model to remove the multicollinearities 
and hence obtain estimates with small variances. 

.It is recommended that the procedures discussed .in this memorandum be made 
available in the Earth ‘Observations Division Laboratory for Applications of 
Remote Sensing classification system. The author has made available to 
NASA/JSC personnel the necessary programs to implement these techniques. 

The results presented in this memorandum are the initial attempts to find a 
yield model for wheat. Additional research should be conducted to estimate 
interaction terms and other ways of measuring trend. 
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